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ABSTRACT 

In  this  paper,  baseline  speech  recognition  performance  is  determined 
both  for  a  single  remote  microphone  and  for  a  signal  derived  from  a 
delay-and-sum  beamformer  using  an  eight-microphone  linear  array. 
An  HMM-based,  connected-speech,  38-word  vocabulary  (alphabet, 
digits,  ‘space’,  ‘period’),  talker-independent  speech  recognition 
system  is  used  for  testing  performance.  Normal  performance,  with 
no  language  model,  i.e.,  raw  word-level  performance,  is  currently 
about  81%  for  a  set  of  talkers  not  in  the  training  set  and  about 
91%  for  training  set  data.  The  system  has  been  trained  and 
tested  using  a  close-talking  head-mounted  microphone.  Since  a 
meaningful  comparison  requires  using  the  same  speech,  the  existing 
speech  database  was  appropriately  pre-filtered,  played  out  through 
a  transducer  (speaker)  in  the  room  environment,  picked-up  by  the 
microphone  array,  and  re-stored  as  a  digital  file.  The  resulting 
file  was  post-processed  and  used  as  input  to  the  recognizer,  the 
recognition  performance  indicates  the  effect  of  the  input  device.  The 
baseline  experiment  showed  that  both  a  single  remote  microphone 
and  the  beamformed  signal  reduced  performance  by  12%  in  a  room 
with  no  other  talkers.  For  the  array  tested,  the  error  is  generally 
attributable  to  reverberation  off  the  floor  and  ceiling. 

1.  Introduction 

It  is  widely  accepted  that  appropriate  data-acquisition  tech¬ 
nology  must  be  available  in  order  to  make  speech-recognition 
a  viable  computer  input  mode  [1,  2,  3].  While  work  has 
been  done  in  the  area  of  signal  conditioning  [4],  for  the 
last  three  years,  research  at  Brown  University  has  been  in 
progress  to  develop  hardware,  software  and  algorithms  as  a 
means  to  make  non-intrusive  speech  acquisition  a  practical 
reality  [5,  6]  Principal  focus  to  date  has  been  to  use  the 
phase  relationships  among  a  group  of  microphones  spaced 
in  a  line  -  hence  a  linear  array  -  for  the  remote,  real-time 
acquisition  of  a  talker’s  data.  Various  beamforming  and  talker 
locationAracking  algorithms  have  been  studied,  reported,  and 
evaluated  relative  to  listening  quality  [7, 8, 9, 10, 11,12] 

The  quality  of  a  speech  data  acquisition  system  may  be 
assessed  in  several  ways.  For  many  applications,  evaluation 
is  usually  given,  quantitatively,  in  terms  of  some  signal- 
to-noise  measure  or  human-listening  experiment  score,  or 
qualitatively  in  terms  of  human  evaluation.  However,  for 
a  system  whose  output  is  fed  to  a  speech  recognizer,  the 
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recognition  performance  is  an  excellent, quantifiable  measure; 
this  approach  and  its  results  make  up  the  body  of  this  paper. 

A  key  problem  for  such  systems  to  overcome  is  that  of  rever¬ 
beration.  Acoustic  reflections  in  a  normal  room  environment 
make  the  output  of  a  remote  microphone  quite  different  from 
that  taken  from  the  normal,  close-talking,  recognizer  micro¬ 
phone.  Several  ways  have  been  suggested  to  alleviate  this 
problem: 

•  A  more  focused  array  system  will  attenuate  reflections 
coming  from  a  wider  off-axis  volume[13].  Many  mi¬ 
crophones  are  required  to  do  this,  and  a  system  with 
beam-width  control  over  a  broad  spectrum  and  in  two  or 
three  directions  is  essential.  This  is  the  spatial-filtering 
approach  to  solving  the  problem. 

•  The  acoustic  environment  near  the  microphones  is  very 
critical.  New  ways  of  mounting  the  microphones  in 
an  appropriately  sound  absorbent  material  substantially 
improve  performance,  without  necessarily  limiting  the 
practicality  of  the  array.  More  directional  elements  can 
also  be  used.  This  is  an  acoustical  approach  to  helping 
to  resolve  the  problem. 

•  One  form  or  another  of  deconvolution  can  be  used 
to  undo  the  effects  of  reverberations  [3,  14,  15,  16, 
17,  18,  19].  Either  directly  or  indirectly,  some  char¬ 
acterization  of  the  room  is  obtained,  usually  as  some 
spatially-dependent  impulse  response.  After  this  non¬ 
trivial  problem  is  solved,  some  processing  “art”  is  often 
essential  to  overcome  nulls  in  the  spectrum  and  perform 
inverse  filtering. 

This  project  investigates  all  of  the  above  methods.  It  might 
be  added  that,  when  working  with  real  acoustic  systems, 
mechanisms  for  reducing  reverberations  must  be  carefully 
applied;  it  is  a  hard  problem.  However,  the  purpose  of 
this  p^r  is  not  to  deal  with  the  improvements  achieved  by 
employing  various  means  to  dereverberate  the  output  signal  of 
the  array;  rather,  it  is  to  set  a  baseline  standard  against  which 
to  compare  futme  developments.  The  problem  is  posed;  how 
badly  does  recognizer  performance  degrade  when  the  input 
signal  is  from  1 )  a  single  remote  omnidirectional  microphone. 
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cr  from  2)  the  beamformed  output  from  a  linear  microphone 
array?  This  experiment  quantifies  the  acceptability  (or  lack 
thereof)  of  using  relatively  straightforward  implementations 
of  remote  microphone  technology  for  speech  recognition. 

2.  The  LEMS  Speech  Recognizer 

An  HMM-based,  coimected-speech,  38-word  vocabulary  (al¬ 
phabet,  digits,  ‘space’,  ‘period’),  talker-independent  speech 
recognition  system  has  been  running  for  two  years  in  the 
LEMS  facility  [20,  21],  This  small,  but  very  difficult  vocab¬ 
ulary  has  many  of  the  problems  associated  with  a  phoneme 
recognizer. 

Speech,  sampled  at  16kHz  from  a  close-talking  microphone, 
is  truncated  through  a  40ms  Hamming  window  every  10ms. 
Twelve  cepstral  coefficients,  twelve  delta  cepstral  coeffi¬ 
cients,  overall  energy  and  delta  overall  energy  comprise  the 
26  element  feature  vector.  Three  256-entry  codebooks  are 
used  to  vector  quantize  the  data  from  cepstral,  delta  cepstral, 
and  eneigy/delta  energy  features  respectively  * .  The  recog¬ 
nizer  differs  from  standard  HMM  models  in  that  durational 
probabilities  are  handled  explicitly  [22].  For  each  state,  self 
transitions  are  disallowed.  During  training,  nonparametric 
statistics  are  gathered  for  each  of  30  potential  durations  in 
the  state,  i.e.,  10ms  to  300ms.  In  the  base  system  used  for 
this  experiment,  a  gamma  distribution  was  fitted  to  the  non¬ 
parametric  statistics.  The  models  used  are  word-level  models 
having  from  five  to  twelve  states.  Only  forward  transitions 
and  skips  of  a  single  state  were  allowed. 

The  best  available  recognizer  at  the  time  was  used  for  the 
experiment,  except  that  the  amount  of  speech  normally  used 
to  develop  the  vector  quantization  codebooks  was  reduced 
from  one  and  one-half  hours  to  15  minutes.  This  made  it 
feasible  to  do  several  full  k-means  re-trainings  of  the  system; 
VQ  training  took  but  two  days  (elapsed  time)  on  a  SUN 
SPARCstation  2  while  VQ  training  for  the  one  and  one-half 
hour  case  would  have  taken  an  unacceptable  twelve  days^! 
The  change  to  the  VQ  training  degraded  performance  for 
the  close-talking  microphone  data  by  1.5%,  i.e.,  the  79% 
performance  of  the  system  for  1)  new  talkers  and  2)  no 
grammar  was  reduced  to  77.5%. 

About  four  hours  of  speech  (2400  connected  strings,  or 
nearly  40,(X)0  vocabulary  items)  from  80  talkers,  half  male, 
half  female,  were  used  to  train  the  hidden  Markov  models. 
Currently,  the  training  procedure  requires  60  hours  of  CPU 
time  from  each  of  eight  SPARC  1+/2  workstations  linked 
in  a  loosely-coupled  fashion  through  sockets.  Well-known 
mechanisms  for  speeding  up  the  process,  such  as  doing  the 

^At  the  time  this  experiment  was  initiated,  semi-continuous  modeling 
of  output  probabilities  and  better  word  models  were  not  yet  a  part  of  the 
systan.  Current  improvements  have  increased  overall  perfcxmanoe  for  the 
head-mounted  microphone  input  by  about  3%. 

^We  are  c^timizing  this  program  now. 


Figure  1 ;  Acoustical  Geometry  of  Array/Sources 


computation  in  the  logarithm  domain  using  integers  and  a 
lookup  table  [23],  as  well  as  some  detailed  new  programming 
speedups  [24]  are  being  used  to  reduce  the  training  time. 

3.  Data  Development 

The  original  speech  data  were  recorded  in  a  large,  generally 
not-too-noisy  room  through  an  Audio  Technica  ATM73a 
head-mounted,  close-talking  microphone.  The  speech  was 
sampled  through  a  Sony  DAT-  16  bits  at  48kHz  sampling 
rate.  It  was  then  digitally  decimated  to  1 6kHz  and  fed  directly 
to  a  SUN  workstation  to  build  a  high-fidelity  database  [25]. 
The  signal-to-noise  ratio  is  about  50dB. 

It  would  not  have  been  possible,  let  alone  feasible,  to  record 
another  large  dataset  from  the  same  talkers  using  the  micro¬ 
phone  array  system  for  acquisition.  Thus,  a  mechanism  had 
to  be  developed  to  use  the  high-fidelity  database  as  input  to 
the  array  recording  system.  A  high-quality  transducer  was 
used  to  play  out  the  speech;  the  geometry  is  shown  in  Figure 
1.  The  resulting  real-time  system  for  the  data  conversion 
is  schematically  shown  in  Figure  2.  Three  SPARC  l+fl 
workstations  are  used.  The  first  converts  the  digital  speech 
data  in  speech  recognition  format  into  digital  data  acceptable 
for  playback  through  the  microphone  array  hardware.  This 
involves  changing  the  sampling  rate  from  16kHz  to  20kHz 
and  then  applying  an  FIR  inverse  filter  to  imdo  the  coloring 
that  will  come  from  the  output  transducer.  This  filter  was 
obtained  by  running  digital,  band-limited  white  noise  with 
DFT  spectrum  W(r)  through  the  transducer  and  recording 
the  output  through  an  ultra-fiat  frequency  response  Briiel 
&  Kjaer  (B&K)  condenser  microphone  system  placed  a  few 


286 


Figure  2:  The  Data  Conversion  System 


Figure  3:  Spectra  of  the  Output  Transducer  System  and 
Inverse  Filter 


centimeters  in  front  of  the  middle  of  the  output  transducer. 
After  accumulating  an  average  magnitude  spectrum  of  the 
B&K’s  output  via  multiple  128-point  DFT’s,  the  spectrum 
S{r)  was  inverted,  i.e.,  y(r)  =  py(r)/5(r),  and  inverse 
transformed  to  produce  a  zero-phase  FIR  filter^.  Any  spectral 
eneigy  attenuated  by  the  anti-aliasing  filter  i.e.,  frequencies 
above  7kHz,  were  forced  to  unity  gain.  5(r)  and  Y (r)  are 
shown  in  Figure  3.  The  subjective  audible  effects  as  well 
as  the  flattened  white-noise  response  indicate  that  this  proce¬ 
dure  was  successful  in  removing  the  ‘boominess’  potentially 
introduced  by  the  transducer  system. 

Initially,  small,  omnidirectional  electret  microphones  were 
mounted  at  the  edge  of  a  5cm  x  10cm  board  containing  am¬ 
plifier/filter  electronics  and  the  board  was  plugged  vertically 
into  a  (2.5m)  horizontal  cage.  Recent  work  disclosed  that 
this  system  formed  resonant  cavities  that  impacted  the  per¬ 
formance  of  the  linear  microphone  array.  When  the  same 
microphones  with  the  same  spacing  (18.4  cm)  were  inserted 
into  a  (180cmx  30cm  x  15cm)  block  of  six  pound  ester  foam, 
the  degradations  due  to  the  cavities  disappeared  as  may  be 
seen  in  Figure  4.  Note  that  the  data  shown  are  for  the 
transducer  output  after  the  noise  has  been  inverse  filtered. 

The  remainder  of  the  data  conversion  system  is  straightfor¬ 
ward.  Twenty  kilohertz  sampling  interrupts  are  used  both 
to  produce  the  speech  output(s)  and  to  digitize  the  analog 
signals  from  the  eight  microphones.  Sufficient  memory  is 
available  for  about  10  second  utterances.  Upon  comple¬ 
tion  of  an  utterance,  the  microphone  data  are  sent  to  a  third 

^Non-zero-phase  inverse  filters  are  also  being  investigated. 


SPARCstation  for  sample-rate  conversion,  signal  processing 
for  recognition,  and  archiving  on  hard  disk  as  feature  vectors 
for  the  recognition  system. 

4.  Experiment  and  Results 

The  system  was  trained,  both  for  VQ  and  for  the  hidden 
Markov  model  parameters,  three  different  times:  1)  for  the 
high-fidelity  data,  2)  for  the  output  of  a  single  microphone 
of  the  array  (a  central  one),  and  3)  for  the  simple  delay-and- 
sum  beamformed  output  of  the  8  microphone  array.  The 
recognizer  was  tested  using  20  new  talkers,  again  half  male 
and  half  female,  for  a  total  of  an  hour  of  speech,  or  about  4800 
vocabulary  items.  The  data  conversion  system  was  run  under 
‘quiet’  conditions.  Not  including  noise  due  to  reverberations, 
the  signal-to-noise  ratios  were  significantly  degraded  by  the 
acoustical  noise  to  24dB  for  the  single  remote  microphone 
and  26dB  for  the  beamformed  signal.  The  results  as  a  function 
of  talker  number  are  plotted  in  Figure  5.  From  the  Figure, 
one  may  deduce  that: 

•  For  all  cases,  variation  with  respect  to  talker  is  far  greater 
than  variations  due  to  other  effects. 

•  Recognition  performance  is  approximately  the  same  for 
the  single  microphone  as  it  is  for  the  beamformed  case, 
given  no  other  point  ‘noise’  sources.. 

•  Performance  for  the  high-fidelity  signal  is  consistently 
about  12%  better  than  for  the  acoustically  degraded 
signal. 
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Figure  4:  Spectra  of  Array  Microphone  Response  in  Cage 
and  Foam 

For  completeness,  each  of  the  test  datasets  was  run  against 
each  of  the  three  systems.  The  results  are  given  in  Table  1. 


Test  Data 
from 

Hi-Fi 

Model  Trained 
Remote  Mike 

from 

Beamformed 

Hi-Fi 

77.5% 

50.0% 

53.6% 

Remote  Mike 

38.8% 

65.6% 

64.6% 

Beamformed 

32.8% 

57.6% 

65.3% 

Table  1 

Averaged  Results  for  Direct  and  Cross-Trained  Systems 


5.  Discussion 

Given  the  degraded  acoustical  environment,  it  was  not  sur¬ 
prising  that  performance  for  the  converted  data  was  reduced 
using  remote-microphone  input.  However,  it  was  somewhat 
surprising  that  this  very  carefully  done  experiment  indicates 
no  performance  advantage  when  simple  beamforming  is  used 
to  generate  the  input.  This  could  be  due  to  the  following; 

•  Low-frequency  background  noise  is  not  effectively  elim¬ 
inated  by  an  acoustic  array  of  this  type  and  size.  Some 
filtering,  perhaps  combined  with  sub-band  types  of  en¬ 
hancements,  should  help. 

•  The  major  reverberations  in  the  room  come  from  the 
ceiling  and  floor.  They  have  been  measured  as  being 
as  much  as  25%  of  the  original  wavefront  in  intensity. 


Figure  5:  RecognitionPerformance  for  the  Three  Acquisition 
Systems 

Even  if  the  reflections  average  10%,  implying  a  14dB 
signal-to-noise  ratio,  ‘quiet’  room  conditions  no  longer 
hold.  A  focused  two  or  three  dimensional  array  could 
attenuate  these  reflections  and  thus  address  the  problem. 
Alternatively,  pressure  gradient  microphones  could  be 
used  in  a  one-dimensional  array  as  done  in  [13]. 

•  There  is  always  some  variability  in  an  acoustical  exper¬ 
iment  regarding  equipment  positioning,  overall  ampli¬ 
tudes,  microphone  calibration  etc.  A\fliile  great  care  was 
taken,  certainly  the  beamformer  output  would  be  more 
susceptible  to  these  variabilities  than  would  be  the  single 
remote  microphone. 

In  order  to  determine  the  impact  of  beamforming,  the  testing 
data  were  run  through  the  data  conversion  system  (source  at 
{Im,  2m))  several  additional  times,  each  with  a  second  trans¬ 
ducer  located  at  (2m,  2m).  This  second  transducer  repeated 
a  few  seconds  of  speech  at  various,  controlled  levels  as  the 
testing  data  were  being  recorded.  This  procedure  permits 
the  assessment  of  the  effects  of  beamforming  with  respect  to 
spatial  filtering  of  off-axis  noise.  The  test  datasets  for  both 
the  single  remote  microphone  and  the  beamfoimed  data  were 
run  through  their  respective  quiet-room  recognizers.  As  the 
purpose  of  this  test  was  to  check  the  simple  beamformer,  more 
elaborate  beamformers  were  not  used  to  generate  the  data  of 
Figure  6.  Also,  note  that  no  background  noise  processing 
(such  as  high-pass  filtering  the  signals)  was  used  to  remove 
the  low-frequency  ‘rumble’  of  the  room. 
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signal  to  Nolaa  Ratio  (dB) 

Figure  6:  Performance  in  a  Noisy  Environment 


As  the  graph  indicates,  there  is  an  appreciable  performance 
gain  using  the  array  for  acoustic  data  collection  in  a  noisy 
environment.  The  simple  beamformer  consistently  scored 
10-15%  higher  than  a  single  microphone  for  SNR’s  less  than 
16dB.  Note  that  in  one  case  the  recognition  result  is  negative. 
This  is  a  consequence  of  the  method  employed  for  calculating 
the  performance  score. 


6.  Conclusion 

An  intricate  experiment  has  been  developed  to  quantify 
the  effects  of  alternative  acoustic  enviromnents  on  speech- 
recognition  systems.  The  performance  of  an  HMM-based 
alphadigit  recogni2er  was  reduced  about  12%  when  input  was 
converted  from  high-fidelity,  close-talking  input  to  either  a 
single  remote  microphone  or  the  output  of  a  delay-and-sum 
beamformer  using  an  eight-microphone  linear  array  under 
quiet  conditions.  Beamforming  did  significantly  improve 
performance  over  that  of  a  single  microphone  for  low  signal- 
to-noise  ratios  and  is  thus  advantageous  in  the  presence  of 
acoustic  interference. 

More  importantly,  though,  the  work  establishes  an  automated 
procedure  for  reconstructing  a  given  database  in  a  new  envi¬ 
ronment,  permitting  the  evaluation  of  acoustic-input  devices. 
Such  a  structured  methodology  has  allowed  the  determination 
of  baseline  performance  and  now  future  improvements  can 
be  appropriately  measured. 
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